{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to DataFrames\n",
    "**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), May 23, 2018**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "using DataFrames # load package"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Handling missing values\n",
    "\n",
    "A singelton type `Missings.Missing` allows us to deal with missing values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(missing, Missings.Missing)"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "missing, typeof(missing)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Arrays automatically create an appropriate union type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4-element Array{Union{Int64, Missings.Missing},1}:\n",
       " 1       \n",
       " 2       \n",
       "  missing\n",
       " 3       "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = [1, 2, missing, 3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`ismissing` checks if passed value is missing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(false, true, false, Bool[false, false, true, false])"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ismissing(1), ismissing(missing), ismissing(x), ismissing.(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can extract the type combined with Missing from a `Union` via\n",
    "\n",
    "(This is useful for arrays!)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(Union{Int64, Missings.Missing}, Int64)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "eltype(x), Missings.T(eltype(x))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`missing` comparisons produce `missing`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(missing, missing, missing)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "missing == missing, missing != missing, missing < missing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is also true when `missing`s are compared with values of other types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(missing, missing, missing)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "1 == missing, 1 != missing, 1 < missing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`isequal`, `isless`, and `===` produce results of type `Bool`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(true, true, false, true)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "isequal(missing, missing), missing === missing, isequal(1, missing), isless(1, missing)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the next few examples, we see that many (not all) functions handle `missing`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4-element Array{Missings.Missing,1}:\n",
       " missing\n",
       " missing\n",
       " missing\n",
       " missing"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "map(x -> x(missing), [sin, cos, zero, sqrt]) # part 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5-element Array{Missings.Missing,1}:\n",
       " missing\n",
       " missing\n",
       " missing\n",
       " missing\n",
       " missing"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "map(x -> x(missing, 1), [+, - , *, /, div]) # part 2 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6-element Array{Any,1}:\n",
       " missing                                            \n",
       " missing                                            \n",
       " (missing, missing)                                 \n",
       " missing                                            \n",
       " missing                                            \n",
       " Union{Float64, Missings.Missing}[1.0, 2.0, missing]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "map(x -> x([1,2,missing]), [minimum, maximum, extrema, mean, any, float]) # part 3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`skipmissing` returns iterator skipping missing values. We can use `collect` and `skipmissing` to create an array that excludes these missing values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2-element Array{Int64,1}:\n",
       " 1\n",
       " 2"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "collect(skipmissing([1, missing, 2, missing]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similarly, here we combine `collect` and `Missings.replace` to create an array that replaces all missing values with some value (`NaN` in this case)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4-element Array{Float64,1}:\n",
       "   1.0\n",
       " NaN  \n",
       "   2.0\n",
       " NaN  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "collect(Missings.replace([1.0, missing, 2.0, missing], NaN))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another way to do this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4-element Array{Float64,1}:\n",
       "   1.0\n",
       " NaN  \n",
       "   2.0\n",
       " NaN  "
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "coalesce.([1.0, missing, 2.0, missing], NaN)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Caution: `nothing` would also be replaced here (for Julia 0.7 a more sophisticated behavior of `coalesce` that allows to avoid this problem is planned)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4-element Array{Float64,1}:\n",
       "   1.0\n",
       " NaN  \n",
       " NaN  \n",
       " NaN  "
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "coalesce.([1.0, missing, nothing, missing], NaN)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can use `recode` if you have homogenous output types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4-element Array{Float64,1}:\n",
       "   1.0\n",
       " NaN  \n",
       "   2.0\n",
       " NaN  "
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "recode([1.0, missing, 2.0, missing], missing=>NaN)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can use `unique` or `levels` to get unique values with or without missings, respectively."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(Any[1, missing, 2], [1, 2])"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "unique([1, missing, 2, missing]), levels([1, missing, 2, missing])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this next example, we convert `x` to `y` with `allowmissing`, where `y` has a type that accepts missings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3-element Array{Union{Int64, Missings.Missing},1}:\n",
       " 1\n",
       " 2\n",
       " 3"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = [1,2,3]\n",
    "y = allowmissing(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we convert back with `disallowmissing`. This would fail if `y` contained missing values!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "([1, 2, 3], Union{Int64, Missings.Missing}[1, 2, 3], [1, 2, 3])"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "z = disallowmissing(y)\n",
    "x,y,z"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this next example, we show that the type of each column in `x` is initially `Int64`. After using `allowmissing!` to accept missing values in columns 1 and 3, the types of those columns become `Union`s of `Int64` and `Missings.Missing`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Before: Type[Int64, Int64, Int64]\n",
      "After: Type[Union{Int64, Missings.Missing}, Int64, Union{Int64, Missings.Missing}]\n"
     ]
    }
   ],
   "source": [
    "x = DataFrame(Int, 2, 3)\n",
    "println(\"Before: \", eltypes(x))\n",
    "allowmissing!(x, 1) # make first column accept missings\n",
    "allowmissing!(x, :x3) # make :x3 column accept missings\n",
    "println(\"After: \", eltypes(x))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this next example, we'll use `completecases` to find all the rows of a `DataFrame` that have complete data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4×2 DataFrames.DataFrame\n",
      "│ Row │ A       │ B       │\n",
      "├─────┼─────────┼─────────┤\n",
      "│ 1   │ 1       │ A       │\n",
      "│ 2   │ \u001b[90mmissing\u001b[39m │ B       │\n",
      "│ 3   │ 3       │ \u001b[90mmissing\u001b[39m │\n",
      "│ 4   │ 4       │ C       │\n",
      "Complete cases:\n",
      "Bool[true, false, false, true]\n"
     ]
    }
   ],
   "source": [
    "x = DataFrame(A=[1, missing, 3, 4], B=[\"A\", \"B\", missing, \"C\"])\n",
    "println(x)\n",
    "println(\"Complete cases:\\n\", completecases(x))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use `dropmissing` or `dropmissing!` to remove the rows with incomplete data from a `DataFrame` and either create a new `DataFrame` or mutate the original in-place."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2-element Array{DataFrames.DataFrame,1}:\n",
       " 2×2 DataFrames.DataFrame\n",
       "│ Row │ A │ B │\n",
       "├─────┼───┼───┤\n",
       "│ 1   │ 1 │ A │\n",
       "│ 2   │ 4 │ C │\n",
       " 2×2 DataFrames.DataFrame\n",
       "│ Row │ A │ B │\n",
       "├─────┼───┼───┤\n",
       "│ 1   │ 1 │ A │\n",
       "│ 2   │ 4 │ C │"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y = dropmissing(x)\n",
    "dropmissing!(x)\n",
    "[x, y]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When we call `showcols` on a `DataFrame` with dropped missing values, the columns still allow missing values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2×2 DataFrames.DataFrame\n",
      "│ Col # │ Name │ Eltype                          │ Missing │ Values  │\n",
      "├───────┼──────┼─────────────────────────────────┼─────────┼─────────┤\n",
      "│ 1     │ A    │ Union{Int64, Missings.Missing}  │ 0       │ 1  …  4 │\n",
      "│ 2     │ B    │ Union{Missings.Missing, String} │ 0       │ A  …  C │"
     ]
    }
   ],
   "source": [
    "showcols(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since we've excluded missing values, we can safely use `disallowmissing!` so that the columns will no longer accept missing values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2×2 DataFrames.DataFrame\n",
      "│ Col # │ Name │ Eltype │ Missing │ Values  │\n",
      "├───────┼──────┼────────┼─────────┼─────────┤\n",
      "│ 1     │ A    │ Int64  │ 0       │ 1  …  4 │\n",
      "│ 2     │ B    │ String │ 0       │ A  …  C │"
     ]
    }
   ],
   "source": [
    "disallowmissing!(x)\n",
    "showcols(x)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Julia 0.6.2",
   "language": "julia",
   "name": "julia-0.6"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "0.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
